Goto

Collaborating Authors

 person detection


SwannBuddy 4K Video Doorbell review: Let the robots run the house

PCWorld

Higher-resolution video, improved overall performance, and an exciting new AI-powered voice response make Swann's second video doorbell a winner. Swann, a longtime player in the security camera world, has been spreading its wings to expand into related smart home gear, including video doorbells. Its first SwannBuddy Video Doorbell, a lackluster release, hit in 2022. The all-new SwannBuddy 4K Video Doorbell expands that device's resolution and image quality considerably, resolving one of the original product's biggest shortcomings. The SwannBuddy 4K offers a familiar design to both the original SwannBuddy and most video doorbells, with a large doorbell button in the center of the device, ringed with light (briefly blue, turning red when recording), a camera lens up top, and a motion sensor at the bottom.


Revisiting Adversarial Patches for Designing Camera-Agnostic Attacks against Person Detection

Neural Information Processing Systems

Physical adversarial attacks can deceive deep neural networks (DNNs), leading to erroneous predictions in real-world scenarios. To uncover potential security risks, attacking the safety-critical task of person detection has garnered significant attention. However, we observe that existing attack methods overlook the pivotal role of the camera, involving capturing real-world scenes and converting them into digital images, in the physical adversarial attack workflow. This oversight leads to instability and challenges in reproducing these attacks. In this work, we revisit patch-based attacks against person detectors and introduce a camera-agnostic physical adversarial attack to mitigate this limitation. Specifically, we construct a differentiable camera Image Signal Processing (ISP) proxy network to compensate for the physical-to-digital transition gap.


Panoramic Distortion-Aware Tokenization for Person Detection and Localization Using Transformers in Overhead Fisheye Images

Wakai, Nobuhiko, Sato, Satoshi, Ishii, Yasunori, Yamashita, Takayoshi

arXiv.org Artificial Intelligence

Person detection methods are used widely in applications including visual surveillance, pedestrian detection, and robotics. However, accurate detection of persons from overhead fisheye images remains an open challenge because of factors including person rotation and small-sized persons. To address the person rotation problem, we convert the fisheye images into panoramic images. For smaller people, we focused on the geometry of the panoramas. Conventional detection methods tend to focus on larger people because these larger people yield large significant areas for feature maps. In equirectangular panoramic images, we find that a person's height decreases linearly near the top of the images. Using this finding, we leverage the significance values and aggregate tokens that are sorted based on these values to balance the significant areas. In this leveraging process, we introduce panoramic distortion-aware tokenization. This tokenization procedure divides a panoramic image using self-similarity figures that enable determination of optimal divisions without gaps, and we leverage the maximum significant values in each tile of token groups to preserve the significant areas of smaller people. To achieve higher detection accuracy, we propose a person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods when applied to large-scale datasets.


Autonomous Navigation in Dynamic Human Environments with an Embedded 2D LiDAR-based Person Tracker

Plozza, Davide, Marty, Steven, Scherrer, Cyril, Schwartz, Simon, Zihlmann, Stefan, Magno, Michele

arXiv.org Artificial Intelligence

In the rapidly evolving landscape of autonomous mobile robots, the emphasis on seamless human-robot interactions has shifted towards autonomous decision-making. This paper delves into the intricate challenges associated with robotic autonomy, focusing on navigation in dynamic environments shared with humans. It introduces an embedded real-time tracking pipeline, integrated into a navigation planning framework for effective person tracking and avoidance, adapting a state-of-the-art 2D LiDAR-based human detection network and an efficient multi-object tracker. By addressing the key components of detection, tracking, and planning separately, the proposed approach highlights the modularity and transferability of each component to other applications. Our tracking approach is validated on a quadruped robot equipped with 270{\deg} 2D-LiDAR against motion capture system data, with the preferred configuration achieving an average MOTA of 85.45% in three newly recorded datasets, while reliably running in real-time at 20 Hz on the NVIDIA Jetson Xavier NX embedded GPU-accelerated platform. Furthermore, the integrated tracking and avoidance system is evaluated in real-world navigation experiments, demonstrating how accurate person tracking benefits the planner in optimizing the generated trajectories, enhancing its collision avoidance capabilities. This paper contributes to safer human-robot cohabitation, blending recent advances in human detection with responsive planning to navigate shared spaces effectively and securely.


Person Segmentation and Action Classification for Multi-Channel Hemisphere Field of View LiDAR Sensors

Seliunina, Svetlana, Otelepko, Artem, Memmesheimer, Raphael, Behnke, Sven

arXiv.org Artificial Intelligence

Robots need to perceive persons in their surroundings for safety and to interact with them. In this paper, we present a person segmentation and action classification approach that operates on 3D scans of hemisphere field of view LiDAR sensors. We recorded a data set with an Ouster OSDome-64 sensor consisting of scenes where persons perform three different actions and annotated it. We propose a method based on a MaskDINO model to detect and segment persons and to recognize their actions from combined spherical projected multi-channel representations of the LiDAR data with an additional positional encoding. Our approach demonstrates good performance for the person segmentation task and further performs well for the estimation of the person action states walking, waving, and sitting. An ablation study provides insights about the individual channel contributions for the person segmentation task. The trained models, code and dataset are made publicly available.


Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

Choi, Lucas, Greer, Ross

arXiv.org Artificial Intelligence

Motorcycle accidents pose significant risks, particularly when riders and passengers do not wear helmets. This study evaluates the efficacy of an advanced vision-language foundation model, OWLv2, in detecting and classifying various helmet-wearing statuses of motorcycle occupants using video data. We extend the dataset provided by the CVPR AI City Challenge and employ a cascaded model approach for detection and classification tasks, integrating OWLv2 and CNN models. The results highlight the potential of zero-shot learning to address challenges arising from incomplete and biased training datasets, demonstrating the usage of such models in detecting motorcycles, helmet usage, and occupant positions under varied conditions. We have achieved an average precision of 0.5324 for helmet detection and provided precision-recall curves detailing the detection and classification performance. Despite limitations such as low-resolution data and poor visibility, our research shows promising advancements in automated vehicle safety and traffic safety enforcement systems.


Towards Contactless Elevators with TinyML using CNN-based Person Detection and Keyword Spotting

Pimpalkar, Anway S., Niture, Deeplaxmi V.

arXiv.org Artificial Intelligence

This study presents a proof of concept for a contactless elevator operation system aimed at minimizing human intervention while enhancing safety, intelligence, and efficiency. A microcontroller-based edge device executing tiny Machine Learning (tinyML) inferences is developed for elevator operation. Using person detection and keyword spotting algorithms, the system offers cost-effective and robust units requiring minimal infrastructural changes. The design incorporates preprocessing steps and quantized convolutional neural networks in a multitenant framework to optimize accuracy and response time. Results show a person detection accuracy of 83.34% and keyword spotting efficacy of 80.5%, with an overall latency under 5 seconds, indicating effectiveness in real-world scenarios. Unlike current high-cost and inconsistent contactless technologies, this system leverages tinyML to provide a cost-effective, reliable, and scalable solution, enhancing user safety and operational efficiency without significant infrastructural changes. The study highlights promising results, though further exploration is needed for scalability and integration with existing systems. The demonstrated energy efficiency, simplicity, and safety benefits suggest that tinyML adoption could revolutionize elevator systems, serving as a model for future technological advancements. This technology could significantly impact public health and convenience in multi-floor buildings by reducing physical contact and improving operational efficiency, particularly relevant in the context of pandemics or hygiene concerns.


Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery

Limberg, Christian, Gonçalves, Artur, Rigault, Bastien, Prendinger, Helmut

arXiv.org Artificial Intelligence

In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception. We focus on person detection and action recognition tasks and evaluate two prominent LMMs, namely YOLO-World and GPT-4V(ision) using a publicly available dataset captured from aerial views. Traditional deep learning approaches rely heavily on large and high-quality training datasets. However, in certain robotic settings, acquiring such datasets can be resource-intensive or impractical within a reasonable timeframe. The flexibility of prompt-based Large Multimodal Models (LMMs) and their exceptional generalization capabilities have the potential to revolutionize robotics applications in these scenarios. Our findings suggest that YOLO-World demonstrates good detection performance. GPT-4V struggles with accurately classifying action classes but delivers promising results in filtering out unwanted region proposals and in providing a general description of the scenery. This research represents an initial step in leveraging LMMs for drone perception and establishes a foundation for future investigations in this area.


Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Fung, Angus, Benhabib, Beno, Nejat, Goldie

arXiv.org Artificial Intelligence

Robotic detection of people in crowded and/or cluttered human-centered environments including hospitals, long-term care, stores and airports is challenging as people can become occluded by other people or objects, and deform due to variations in clothing or pose. There can also be loss of discriminative visual features due to poor lighting. In this paper, we present a novel multimodal person detection architecture to address the mobile robot problem of person detection under intraclass variations. We present a two-stage training approach using 1) a unique pretraining method we define as Temporal Invariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal Faster R-CNN (MFRCNN) detector. TimCLR learns person representations that are invariant under intraclass variations through unsupervised learning. Our approach is unique in that it generates image pairs from natural variations within multimodal image sequences, in addition to synthetic data augmentation, and contrasts crossmodal features to transfer invariances between different modalities. These pretrained features are used by the MFRCNN detector for finetuning and person detection from RGB-D images. Extensive experiments validate the performance of our DL architecture in both human-centered crowded and cluttered environments. Results show that our method outperforms existing unimodal and multimodal person detection approaches in terms of detection accuracy in detecting people with body occlusions and pose deformations in different lighting conditions.


A Novel Voronoi-based Convolutional Neural Network Framework for Pushing Person Detection in Crowd Videos

Alia, Ahmed, Maree, Mohammed, Chraibi, Mohcine, Seyfried, Armin

arXiv.org Artificial Intelligence

Analyzing the microscopic dynamics of pushing behavior within crowds can offer valuable insights into crowd patterns and interactions. By identifying instances of pushing in crowd videos, a deeper understanding of when, where, and why such behavior occurs can be achieved. This knowledge is crucial to creating more effective crowd management strategies, optimizing crowd flow, and enhancing overall crowd experiences. However, manually identifying pushing behavior at the microscopic level is challenging, and the existing automatic approaches cannot detect such microscopic behavior. Thus, this article introduces a novel automatic framework for identifying pushing in videos of crowds on a microscopic level. The framework comprises two main components: i) Feature extraction and ii) Video labeling. In the feature extraction component, a new Voronoi-based method is developed for determining the local regions associated with each person in the input video. Subsequently, these regions are fed into EfficientNetV1B0 Convolutional Neural Network to extract the deep features of each person over time. In the second component, a combination of a fully connected layer with a Sigmoid activation function is employed to analyze these deep features and annotate the individuals involved in pushing within the video. The framework is trained and evaluated on a new dataset created using six real-world experiments, including their corresponding ground truths. The experimental findings indicate that the suggested framework outperforms seven baseline methods that are employed for comparative analysis purposes.